ChatGPT's Ability to Assess Quality and Readability of Online Medical Information: Evidence From a Cross-Sectional Study

Introduction Artificial Intelligence (AI) platforms have gained widespread attention for their distinct ability to generate automated responses to various prompts. However, its role in assessing the quality and readability of a provided text remains unclear. Thus, the purpose of this study is to evaluate the proficiency of the conversational generative pre-trained transformer (ChatGPT) in utilizing the DISCERN tool to evaluate the quality of online content regarding shock wave therapy for erectile dysfunction. Methods Websites were generated using a Google search of “shock wave therapy for erectile dysfunction” with location filters disabled. Readability was analyzed using Readable software (Readable.com, Horsham, United Kingdom). Quality was assessed independently by three reviewers using the DISCERN tool. The same plain text files collected were inputted into ChatGPT to determine whether they produced comparable metrics for readability and quality. Results The study results revealed a notable disparity between ChatGPT's readability assessment and that obtained from a reliable tool, Readable.com (p<0.05). This indicates a lack of alignment between ChatGPT's algorithm and that of established tools, such as Readable.com. Similarly, the DISCERN score generated by ChatGPT differed significantly from the scores generated manually by human evaluators (p<0.05), suggesting that ChatGPT may not be capable of accurately identifying poor-quality information sources regarding shock wave therapy as a treatment for erectile dysfunction. Conclusion ChatGPT’s evaluation of the quality and readability of online text regarding shockwave therapy for erectile dysfunction differs from that of human raters and trusted tools. Therefore, ChatGPT's current capabilities were not sufficient for reliably assessing the quality and readability of textual content. Further research is needed to elucidate the role of AI in the objective evaluation of online medical content in other fields. Continued development in AI and incorporation of tools such as DISCERN into AI software may enhance the way patients navigate the web in search of high-quality medical content in the future.


Introduction
As technology advances at a rapid pace, the importance of health literacy is increasing, as patients must comprehend complex medical information to make informed decisions about their health. Inadequate health literacy is strongly associated with social determinants of health, such as employment status, lifetime income, and education [1,2]. The readability of medical content, defined as the simplicity by which written materials can be understood, is paramount to breaking barriers among all patient populations. Poor readability of medical content can have detrimental effects, such as patient misunderstanding, and may subsequently impact treatment decisions. Recent literature suggests that online information and social media content regarding men's health is of poor quality [3,4]. Our group previously assessed the readability and quality of online content regarding shockwave therapy for erectile dysfunction using Readable software (Readable.com, Horsham, United Kingdom) and DISCERN instrument, respectively [3,5,6]. We found that content generated from private medical practices was as readable as articles from universities or news media, but of lower scientific quality as evidenced by lower DISCERN scores [3].
Artificial intelligence (AI) is the capability of computer systems to carry out tasks that conventionally require human intelligence, such as visual perception and decision-making. OpenAI, an AI research laboratory, developed the large language model conversational generative pre-trained transformer (ChatGPT) for public use, which has been trained on tremendous amounts of text data and employs deep learning algorithms. ChatGPT imitates and interprets human language with high precision and has been demonstrated to perform well for a variety of broad purposes, including performance on standardized examinations, such as the USMLE Step 1, research queries, drug development, and discovery [7][8][9][10][11][12][13].
The integration and incorporation of AI and web search capabilities are rapidly being established. Free AI web-assistant tools such as Monica and HARPA AI have been developed to streamline the process of web browsing. These tools integrate ChatGPT into the web browsing sidebar, providing users with summaries of webpages in a simplified format after conducting an initial Google search. Although these tools enhance efficiency and enable users to access a larger volume of information in less time, the ability of ChatGPT to evaluate the readability and quality of medical text and detect biases remains uncertain. Evaluating the ability of ChatGPT to determine the quality of online medical content is currently imperative and will become increasingly important, as patients increasingly rely on the internet and social media for medical information.
Both Readable software and the DISCERN tool are established and trusted standards in health communication [3,5,6]. Widely used and validated by experts, they offer a reliable framework for evaluating medical text readability and quality [3,5,6]. To our knowledge, no previous research has been conducted evaluating ChatGPT's proficiency in utilizing Readable software and the DISCERN tool to appraise medical text readability and quality. Thus, the aim of this study was to feed online medical text previously collected from our group's prior study into ChatGPT and examine whether it produced comparable outcomes for both readability and quality. We hypothesized that ChatGPT's findings would mirror those of our prior investigation given its high performance across a multitude of tasks in various sectors.

Materials And Methods
In our previous study, we performed a Google search of "shock wave therapy for erectile dysfunction" with location filters disabled [3]. All websites containing articles on the first page of Google were copied and downloaded as plain text files. Of the 10 websites on the first page of the Google search, one website was excluded since it led to a scientific article. Readability was evaluated using Readable software (Readable.com, Horsham, United Kingdom). Quality of content was scored independently by three authors using the DISCERN questionnaire tool, which investigates the quality of information regarding treatment options for diseases or conditions. The same plain text files collected in our previous study were used in the current study but were instead entered into ChatGPT. ChatGPT Mar23 version was used on May 2, 2023. Quality was assessed by instructing ChatGPT to answer the 16 questions from the DISCERN tool (which had been done manually by three authors in our previous study [3]). We also instructed ChatGPT to assess the readability of the entered text. The prompt was inserted three times into new chat boxes each time.

Instructions for readability assessment
The following was inputted into ChatGPT to assess the readability of texts: "Using the article copied, rate the readability of the article using the following scales: Flesch-Kincaid level, Gunningfox index, Coleman-Liau index, SMOG index, Automated readability index, FORCAST grade level, Flesch reading ease."

Instructions for quality assessment
The following was inputted into ChatGPT to assess the quality of texts:

Statistics
The collected DISCERN scores from the three independent reviewers were assessed to ensure that no significant differences existed between raters, and the scores were then averaged.
The ChatGPT-calculated DISCERN scores were compared to our previously determined findings assessed by three humans using a student t-test. In addition, the ChatGPT-calculated readability scores were compared to our previously determined findings assessed by Readable.com.
A one-way ANOVA with a Tukey post-hoc analysis was utilized to determine whether the ChatGPTcalculated DISCERN scores were statistically different between articles from academic versus private sources.

Results
Overall, ChatGPT was able to successfully evaluate the quality of the nine texts using the 16-question DISCERN tool. Additionally, ChatGPT was able to successfully evaluate the readability of the nine texts using seven different readability scales. While ChatGPT often generated different DISCERN score values for the same article/question upon repeat query, there were notably no significant differences between the DISCERN scores among each of the three queries performed by ChatGPT, indicating strong inter-rater reliability and concordance (p=0.94), suggesting that ChatGPT's answers are precise.
Six out of 16 (37.5%) DISCERN questions (questions 5, 9, 10, 11, 12, and 15) yielded scores that were statistically different between AI and human reviewers (p<0.05). For these six questions, ChatGPT suggested that the quality of the text was significantly higher than the score of the three human reviewers in five out of the six (83.3%) DISCERN questions (p<0.05) ( Table 1). This suggests ChatGPT's performance in assessing the quality of textual content was inadequate, and it did not demonstrate consistency with the standards set by human evaluators.  Readability values in all seven indices (Flesch-Kincaid level, Gunningfox index, Coleman-Liau index, SMOG index, Automated readability index, FORCAST grade level, Flesch reading ease) were significantly different when comparing Readable.com to ChatGPT (p<0.05). ChatGPT consistently determined readability to be at a lower reading level than the values determined by Readable.com ( Table 2). The findings suggest a lack of concurrence between ChatGPT's algorithm and that of reputable tools, such as Readable.com.  In our previous study, we determined a significant difference in DISCERN scores between articles from private clinic versus academic or news websites (p<0.001) [3], suggesting that online information from private clinics was more biased and misleading. The analyzed sample comprised nine articles, of which four originated from private clinics and medical practices and five from academic institutions or news outlets. In this study, ChatGPT's DISCERN scores did not differ between private and academic/news sources (p=0.167), suggesting that ChatGPT was unable to accurately identify poor-quality information sources regarding shock wave therapy as a treatment for erectile dysfunction ( Table 3).  * The mean difference is significant at the 0.05 level.

Discussion
The incorporation of AI (Figure 1) into the healthcare industry is seemingly inevitable. AI may play a role in various healthcare settings, including clinical practice, research investigation, patient education, peer review [14], and surgical technology [13][14][15]. Specifically, ChatGPT has become a powerful tool both for patients and physicians in both acquiring and distributing information. It is our responsibility as academic urologists to investigate the role of emerging AI technology to determine how we can best enhance patient care and research endeavors. Given the growing volume of medical content online and on social media platforms, there is a growing need for efficient and efficacious ways of evaluating the quality and readability of such medical content to ensure patients are receiving appropriate information. To our knowledge, this study represents the first to specifically compare the performance of ChatGPT to human reviewers in evaluating the quality of online medical content, particularly that related to men's health. Continued research is warranted to understand the implications of ChatGPT's role in evaluating online medical content. Perhaps, if proven to be efficacious and accurate, it may lead to the ultimate development of AI plugins such as the DISCERN tool geared toward enhancing patient understanding of complex medical information across social media platforms and web browsers.
Restorative therapies for erectile dysfunction have attracted significant interest and research attention within men's health with multiple ongoing clinical trials and many unanswered questions [16,17]. A high prevalence of poor-quality information online poses a particular challenge to today's urologists [16,17]. We theorized that AI may be able to efficiently assess and discern credible information among online text sources with similar efficacy as human reviewers. Our study findings indicate that ChatGPT may not yet be able to complete this task, as well as humans. Specifically, ChatGPT-generated inconsistent DISCERN scores, upon repeat queries, overstated information quality and underscored information readability levels compared with human reviewers. Nonetheless, all queries performed with ChatGPT were successful, quick, and efficient, suggesting that there is a strong potential for AI to improve with continued exposure to content. Perhaps AI can achieve or surpass the ability of human reviewers to evaluate online medical content.
It remains unclear why there exists a discrepancy between Readable.com versus AI rating of readability. The effectiveness of classifying a text into readability levels depends on various factors, including the selection of the dataset, the choice of algorithm, and the selection of features to be extracted from the text [18]. AI algorithms may possess unique qualities that distinguish them from conventional readability tools that have been in use for a significant period.
This study is not without limitations. Given the unverified and automated nature of the ChatGPT platform, it may be difficult to draw conclusions regarding readability. Furthermore, given that this was a crosssectional study, we understand that the search results yielded are based on a specific point in time and do not reflect the identical search circumstances of each patient. However, despite these limitations, the present data shed light on the potential role and feasibility of AI's ability to evaluate the quality and readability of online medical content.

Conclusions
ChatGPT is a promising AI tool that can be utilized to evaluate the quality and readability of online medical text. However, in its current state, it does so with less efficacy when compared to that of human reviewers and readability assessment tools, such as Readable.com.
ChatGPT's evaluation of current articles from websites regarding shockwave therapy, as a treatment for erectile dysfunction, generates overstated quality metrics and readability scores at a lower reading level than that of other online readability assessment tools. Thus, more investigation is warranted to both optimize ChatGPT's medical content evaluation capabilities and to elucidate its role in enhancing patient access to high-quality information. Incorporating DISCERN tools and other established reading assessment tools into future ChatGPT software may be necessary to enhance its ability to evaluate the quality and readability of textual content and thus improve the efficacy of AI in this domain.

Additional Information Disclosures
Human subjects: All authors have confirmed that this study did not involve human participants or tissue. Animal subjects: All authors have confirmed that this study did not involve animal subjects or tissue.

Conflicts of interest:
In compliance with the ICMJE uniform disclosure form, all authors declare the following: Payment/services info: This work was supported by the National Institutes of Health Grant R01 DK130991 and the Clinician Scientist Development Grant from the American Cancer Society to RR. Financial relationships: All authors have declared that they have no financial relationships at present or within the previous three years with any organizations that might have an interest in the submitted work. Other relationships: All authors have declared that there are no other relationships or activities that could appear to have influenced the submitted work.